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Abstract 

This paper proposes a general adaptive procedure for budget-limited predictor design in high di¬ 
mensions called two-stage Sampling, Prediction and Adaptive Regression via Correlation Screening 
(SPARCS). SPARCS can be applied to high dimensional prediction problems in experimental science, 
medicine, finance, and engineering, as illustrated by the following. Suppose one wishes to run a sequence 
of experiments to learn a sparse multivariate predictor of a dependent variable Y (disease prognosis 
for instance) based on a p dimensional set of independent variables X = [A),..., X p ] T (assayed 
biomarkers). Assume that the cost of acquiring the full set of variables X increases linearly in its 
dimension. SPARCS breaks the data collection into two stages in order to achieve an optimal tradeoff 
between sampling cost and predictor performance. In the first stage we collect a few ( n ) expensive 
samples {pi, x,}” =1 , at the full dimension p A n of X, winnowing the number of variables down to 
a smaller dimension l < p using a type of cross-correlation or regression coefficient screening. In the 
second stage we collect a larger number (t — n) of cheaper samples of the l variables that passed the 
screening of the first stage. At the second stage, a low dimensional predictor is constructed by solving the 
standard regression problem using all t samples of the selected variables. SPARCS is an adaptive online 
algorithm that implements false positive control on the selected variables, is well suited to small sample 
sizes, and is scalable to high dimensions. We establish asymptotic bounds for the Familywise Error Rate 

Parts of this work were presented at the 2013 Conference on Artificial Intelligence and Statistics (AISTATS) and at the 2013 
IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). 

This research was partially supported by the US National Science Foundation under grants CCF-1217880, DMS-CMG- 
1025465, AGS-1003823, DMS-1106642, and DMS-CAREER-1352656, by the US Air Force Office of Scientific Research under 
grant FA9550-13-1-0043, and by the US Army Research Office under grant W91 INF-15-1-0479. 


October 4. 2016 


DRAFT 



2 


(FWER), specify high dimensional convergence rates for support recovery, and establish optimal sample 
allocation rules to the first and second stages. 

Index Terms 

high dimensional regression, predictive modeling, model selection, thresholding, two-stage prediction, 
graphical models. 


I. Introduction 

Much effort has been invested in the sparse regression problem where the objective is to learn a sparse 
linear predictor from training data {t/j, x t \ , x-a ,..., , where the number p of predictor variables is 

much larger that the number n of training samples. Applications in science and engineering where such 
“small n large p” problems arise include: sparse signal reconstruction [11], [15]; channel estimation in 
multiple antenna wireless communications [27], [8]; text processing of internet documents [20], [14]; gene 
expression array analysis [24]; combinatorial chemistry [47]; environmental sciences [45]; and others [26]. 
In this R«p setting training a linear predictor becomes difficult due to rank deficient normal equations, 
overfitting errors, and high computational complexity. 

A large number of methods for solving the sparse regression problem have been proposed. These 
include methods that simultaneously perform variable selection, and predictor design, and the methods 
that perform these two operations separately. The former class of methods includes, for example, least 
absolute shrinkage and selection operator (LASSO), elastic LASSO, and group LASSO [26], [48], [16], 
[9], [57], [21], [10], The latter class of methods includes sequential thresholding approaches such as 
sure independence screening (SIS); and marginal regression [17], [22], [23], [18], All of these methods 
are offline in the sense that they learn the predictor from a batch of precollected samples of all the 
variables. In this paper we propose an online framework, called two-stage Sampling, Prediction and 
Adaptive Regression via Correlation Screening (SPARCS), which unequally and adaptively samples the 
variables in the process of constructing the predictor. One of the principal results of this paper is that, 
as compared under common sampling budget constraints, the proposed SPARCS method results in better 
prediction performance than offline methods. 

Specifically, the SPARCS method for online sparse regression operates in two-stages. The first stage, 
which we refer to as the SPARCS screening stage, collects a small number of full dimensional samples 
and performs variable selection on them. Variable selection at the SPARCS screening stage can be 
performed in one of two ways, i.e., by screening the sample cross-correlation between Y and X, as in 
sure independence screening (SIS), or by thresholding the generalized Ordinary Least Squares (OLS) 
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solution, which we propose in this paper and we refer to as predictive correlation screening (PCS). 
The second stage of SPARCS, referred to as the SPARCS regression stage, collects a larger number of 
reduced dimensional samples, consisting only of the variables selected at the first stage, and regresses 
the responses on the the selected variables to build the predictor. 

We establish the following theoretical results on SPARCS. First, under a sparse correlation assumption, 
we establish a Poisson-like limit theorem for the number of variables that pass the SPARCS screening 
stage as p —> oo for fixed n. This yields a Poisson approximation to the probability of false discoveries 
that is accurate for small n and very large p. The Poisson-like limit theorem also specifies a phase 
transition threshold for the false discovery probability. Second, with n, the number of samples in the first 
stage, and t, the total number of samples, we establish that n needs only be of order logp for SPARCS 
to succeed in recovering the support set of the optimal OLS predictor. Third, given a cost-per-sample 
that is linear in the number of assayed variables, we show that the optimal value of n is on the order of 
logf. The above three results, established for our SPARCS framework, can be compared to theory for 
correlation screening [30], [31], support recovery for multivariate LASSO [41], and optimal exploration 
vs. exploitation allocation in multi-armed bandits [5]. 

SPARCS can of course also be applied offline. When implemented in this way, it can be viewed as 
an alternative to LASSO-type regression methods [48], [42], [50], [33], [52], LASSO based methods 
try to perform simultaneous variable selection and regression via minimizing an l \-regularized Mean 
Squared Error (MSE) objective function. Since the (:) -regularized objective function is not differentiable, 
such an optimization is computationally costly, specially for large p. Several approaches such as LARS 
[16], [35], [32], gradient projection methods [19], [43], interior point methods [37], [38] and active-set- 
type algorithms [36], [55], [56] have been developed to optimize the LASSO objective function. SPARCS 
however differs from LASSO as it does not consider a regularized objective function and does not require 
costly iterative optimization. Instead, it performs variable selection via thresholding the min-norm solution 
to the non-regularized OLS problem. 

Offline implementation of the proposed SPARCS method can be compared with correlation learning, 
also called marginal regression, simple thresholding, and sure independence screening [22], [23], [17], 
wherein the simple sample cross-correlation vector between the response variable and the predictor 
variables is thresholded. The theory developed in this paper also yields phase transitions for the familywise 
false discovery rate for these methods. 

The SPARCS screening stage has some similarity to recently developed correlation screening and hub 
screening in graphical models [30], [31]. However, there are important and fundamental differences. The 
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methods in [30], [31] screen for connectivity in the correlation graph, i.e., they only screen among the 
predictor variables {X \,..., X p \. SPARCS screens for the connections in the bi-partite graph between 
the response variable Y and the predictor variables X \,..., X p . Thus SPARCS is a supervised learning 
method that accounts for Y while the methods of [30], [31] are unsupervised methods. 

SPARCS can also be compared to sequential sampling methods, originating in the pioneering work of 
[51]. This work has continued in various directions such as sequential selection and ranking and adaptive 
sampling schemes [7], [25]. Recent advances include the many multi-stage adaptive support recovery 
methods that have been collectively called distilled sensing [29], [28], [53], [54] in the compressive 
sensing literature. While bearing some similarities, our SPARCS approach differs from distilled sensing 
(DS). Like SPARCS, DS performs initial stage thresholding in order to reduce the number of measured 
valuables in the second stage. However, in distilled sensing the objective is to recover a few variables with 
high mean amplitudes from a larger set of initially measured predictor valuables. In contrast, SPARCS 
seeks to recover a few valuables that are strongly predictive of the response valuable from a large number 
of initially measured predictor variables and the corresponding response valuable. Furthermore, unlike in 
DS, in SPARCS the final predictor uses all the information on selected variables collected during both 
stages. 

The paper is organized as follows. Section II provides a practical motivation for SPARCS from the 
perspective of an experimental design problem in biology. It introduces the under-determined multivariate 
regression problem and formally defines the two stages of the SPARCS algorithm. Section III develops 
high dimensional asymptotic analysis for screening and support recovery performance of SPARCS. 
Section III also provides theory that specifies optimal sample allocation between the two stages of 
SPARCS. Section IV presents simulation comparisons and an application to symptom prediction from 
gene expression data. 

II. Two-stage SPARCS method for online sparse regression 

In this section we motivate the two-stage SPARCS method for online sparse regression via an exper¬ 
imental design problem in biology. Moreover, we formally define each stage of the two-stage SPARCS 
method. 

A. Motivation and definition for SPARCS 

As a practical motivation for SPARCS consider the following sequential design problem that is relevant 
to applications where the cost of samples increases with the number p of valuables. This is often the 
case for example, in gene microarray experiments: a high throughput “full genome” gene chip with 
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Agilent Array Price 



Fig. 1. Price of arrays as a function of the number of probes. The dots represent pricing per slide for Agilent Custom 
Microarrays G2509F, G2514F. G4503A, G4502A (May 2014). The cost increases as a function of probeset size. Source: BMC 
Genomics and RNA Profiling Core. 


p = 40, 000 gene probes can be significantly more costly than a smaller assay that tests fewer than 
p = 15, 000 gene probes (see Fig. 1). In this situation a sensible cost-effective approach would be to use 
a two-stage procedure: first select a smaller number l of variables on a few expensive high throughput 
samples and then construct the predictor on additional cheaper low throughput samples. 

Motivated by the above practical example, we propose SPARCS as the following two-stage procedure. 
The first stage of SPARCS, also referred to as the SPARCS screening stage, performs variable selection 
and the second stage, also referred to as the SPARCS regression stage, constructs a predictor using the 
variables selected at the first stage. More specifically, assume that there are a total of t samples {y,. x,}'■_, 
available. During the first stage a number n < t of these samples are assayed for all p variables and during 
the second stage the rest of the t — n samples are assayed for a subset of I < p of the variables selected in 
the first stage. Variable selection at the SPARCS screening stage can be performed in one of two ways, (1) 
by screening the sample marginal cross-correlation between Y and X, as in sure independence screening 
(SIS), or (2) by thresholding the solution to the generalized Ordinary Least Squares (OLS) problem, 
which we refer to as predictive correlation screening (PCS). Subsequently, the SPARCS regression stage 
uses standard OLS to design a /-variable predictor using all t samples collected during both stages. 

An asymptotic analysis (as the total number of samples t —t oo) of the above two-stage predictor is 
undertaken in Sec. Ill to obtain the optimal sample allocation for stage 1 and stage 2. Assuming that 
a sample of a single variable has unit cost and that the total available budget for all of the samples is 
p, the asymptotic analysis yields minimum Mean Squared Error (MSE) when n, t, p, and k satisfy the 
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budget constraint: 

np + (t — n)k < p, (1) 

where k is the true number of active variables in the underlying linear - model. The condition in (1) is 
relevant in cases where there is a bound on the total sampling cost of the experiment and the cost of a 
sample increases linearly in its dimension p. 

B. SPARCS screening stage 

We start out with some notations. Assume that n i.i.d. paired realizations of X = [X \...., X p ] and Y 
are available, where X is a random vector of predictor variables and Y is a scalar - response variable to 
be predicted. We represent the n x p predictor data matrix as X and the n x 1 response data vector as 
Y. The p x p sample covariance matrix S ;,: for the rows of the data matrix X is defined as: 

1 n 

S x = -- V '(Xj - x) T (xj - x), (2) 

n — 1 ' 

i= 1 

where x,; is the i-th row of data matrix X, and x is the vector average of all n rows of X. We also denote 
the sample variance of the elements of Y as s y . 

Consider the n x (p+ 1) concatenated matrix W = [X, Y]. The sample cross-covariance vector S xy is 
defined as the upper right p x 1 block of the (p + 1) x (p + 1) sample covariance matrix obtained by (2) 
using W as the data matrix instead of X. The p x p sample correlation matrix R x is defined as 

R x = Dg|s x Dgi.; f (3) 

where Da represents a matrix that is obtained by zeroing out all but diagonal entries of A. Moreover, 
the p x 1 sample cross-correlation vector YL xy is defined as: 

R*v = Dgjs xy (s y )~ 2 . (4) 

The SIS method for the SPARCS screening stage selects the desired number of variables, l, by picking 
the l variables that have the largest absolute sample correlation with the response variable Y. Therefore, 
SIS performs support recovery by discovering the entries of R ; ' 7/ whose absolute value is larger than 
some threshold. 

Next we introduce the under-determined ordinary least squares (OLS) multivariate regression problem. 
Assume that n < p. We define the generalized Ordinary Least Squares (OLS) estimator of Y given X 
as the min-norm solution of the under-determined least squares regression problem 

min IIY-XB^IlL (5) 
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where j A|| /? represents the Frobenius norm of matrix A. The min-norm solution to (5) is the vector of 
regression coefficients 

B xy = (S x ) t S* w , (6) 

where A ' denotes the Moore-Penrose pseudo-inverse of the matrix A. If the i-th entry of the regression 
coefficient vector B xy is zero then the i-th predictor variable is not included in the OLS estimator. This 
is the main motivation for the PCS method for variable selection at the SPARCS screening stage. More 
specifically, the PCS method selects the l entries of B xy having the largest absolute values. Equivalently, 
PCS performs support recovery by discovering the entries of the generalized OLS solution B xy whose 
absolute value is larger than some threshold. 

In Sec. III-C we will see that, under certain assumptions, SIS and PCS admit similar asymptotic 
support recovery guarantees. However, our experimental results in Sec. IV show that for n <C p, if SIS 
(or LASSO) is used instead of PCS in the SPARCS screening stage, the performance of the two-stage 
predictor suffers. This empirical observation suggests that pre-multiplication of S xy by the pseudo-inverse 
(S x y [ instead of by the diagonal matrix Dg^ 2 , can improve the performance of the SPARCS procedure. 

C. SPARCS regression stage 

In the second stage of SPARCS, a number t, — n of additional samples arc collected for the l < p 
variables found by the SPARCS screening stage. Subsequently, a sparse OLS predictor of Y is constructed 
using only the l variables selected at the SPARCS screening stage. Specifically, the predictor coefficients 
are determined from all of the t samples according to 

(s?i))- ls <?’ (7) 

where and S xy are the I x I sample covariance matrix and the l x 1 sample cross-covariance vector 
obtained for the set of l variables selected by the SPARCS screening stage. 

In Sec. Ill we establish high dimensional statistical convergence rates for the two stage online SPARCS 
procedure and we obtain asymptotically optimal sample allocation proportions n/t and (t — n)/t for the 
first and second stage. 


III. Asymptotic analysis 

A. Notations and assumptions 

In this section we introduce some additional notation and state the required assumptions for our 
asymptotic statistical analysis of SPARCS. 
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The following notations are required for the propositions in this section. The surface area of the (n —2)- 
dimensional unit sphere S n -2 in M n_1 is denoted by by a n . In the sequel we often refer to a vector on 
S n -2 as a unit norm vector. 

Our statistical analysis of SPARCS uses the U-score representations of the data. More specifically, 
there exist a (n-l)xp matrix IF with unit norm columns, and a (n — 1) x 1 unit norm vector ] U y such 
that the following representations hold [30], [31]: 

R x = (8 ) 

and 

K xy = (IF) T U 2 '. (9) 

Specifically, the columns of the matrices IF and IF in the above representations are called U-scores. U- 
scores lie on the (n — 2)-sphere S n -2 in R n_1 and arc constructed by projecting away the component of 
the Z-scores that are orthogonal to the n — 1 dimensional hyperplane {u <E R n : l 7 u = 0}. The sample 
correlation between X, and X :j can be computed using the inner product or the Euclidean distance 
between associated U-scores: 

HU* — U x || 2 

r?j = (Uf) T U x = 1 - * 2 J " 2 - (10) 

Similarly, the sample correlation between X, and Y can be computed as: 

if = (uf) T u y = 1 - K ( 11 ) 

More details about the U-scores representations can be found in [30], [31] and in the Appendix. 

Assume that U,V arc two independent and uniformly distributed random vectors on S n - 2 . For a 
threshold p £ [0,1], let r = \/‘2(l — p). Po(p, n) is then defined as the probability that either 11U — V11 2 < 
r or ||U + V|| 2 v . ?r) can be computed using the foimula foi the aiea of sphencal caps on S^i — 2 

(cf. [40]): 

P 0 = Ii_ p 2 ((n- 2)/2,l/2), (12) 

in which I x (a,b) is the regularized incomplete beta function. 

S C {1,..., p) denotes the set of indices of the variables selected by the SPARCS screening stage. 
Moreover, l refers to the number of variables selected at the SPARCS serening stage, i.e., | S\ = l. 

For the asymptotic analysis we assume that the response Y is generated from the following statistical 
model: 

T = oqAji + at 2 Xi 2 + • • • + a,i k Xi k + N, (13) 
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where 7To = {*i, ■ ■ • , ik } is a set of distinct indices in {1,.., ,p}, X = [X\, X 2 , • • • , X p ] is the vector of 
predictors, Y is the response variable, and N is a noise variable. X t[ , • • • , X lk arc called active variables 
and the remaining p — k variables arc called inactive variables. In the sequel, we refer to the set 7To as 
the support set, and 17To | = k denotes the number of active variables. 

Unless otherwise specified, throughout this paper we consider random data matrices X that satisfy the 
following: for every e > 0 there exist a constant C > 0 such that the following concentration property 
holds: 

P( ||Ds* - DeJ| > e ) < exp(-Cn), (14) 

in which ||A|| is the operator norm of A, S'’ is the sample covariance matrix defined in (2), and X x is 
the population covariance matrix. Property (14) is similar to, but weaker than, the concentration property 
introduced in [17] as it only implies bounds on the joint convergence rate of the diagonal entries of 
the sample covariance matrix (i.e., sample variances of the predictor variables X], - ■ ■ , X p ) and does 
not imply bounds on the convergence of the off-diagonal entries of the sample covariance matrix (i.e., 
sample cross covariances of the predictor variables X\. ■ ■ ■ . X p ). It is known that the concentration 
property (14) holds when the predictors X follow a p-variate distribution with sub-Gaussian tails [39]. 
It is worth mentioning that the concentration property (14) is also satisfied when the linear model (13) 
is assumed on the standardized observations for which D$ = Dj = I p . 

In our asymptotic analysis of SPARCS we make the following additional assumptions on the linear 
model (13), which arc comparable or weaker than assumptions made in other studies [41], [17], [12], 
[49], [13]. 

Assumption 1: The n x p data matrix X follows a multivariate elliptically contoured distribution with 
mean p, x and p x p dispersion matrix X x , i.e. the probability density function (pdf) is of the form 
/x(X) = g(tr ((X — l/r x )S“ 1 (X — 1 p x ) T )^J, where g is a non-negative function and tr(A) is the trace 
of A. Moreover, the density function /x(-) is bounded and differentiable. 

Assumption 2: Let p y i represent the true correlation coefficient between response variable Y and 
predictor variable X % . The quantity 

Pmin = min {I Pyi \ ~ I Pyj I} > (15) 

ie7T 0 je{l,--- ,p}\7To 

is strictly positive and independent of p. 

Assumption 3: The (n — 1) x p matrix of U-scores satisfies (with prob. 1): 

——^U E (U a; ) ,r = I n _i -fo(l), as p —> oo, (16) 

P 

in which o(l) is a (n — 1) x (n — 1) matrix whose entties are o(l). 
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Assumption 1 is weaker than the Gaussian assumption commonly used in compressive sensing [28], [6] 
and, unlike standard sub-Gaussian assumptions commonly used in high dimensional data analysis [10], 
allows for heavy tails. Assumption 2 is a common assumption that one finds in performance analysis of 
support recovery algorithms (cf. [41], [17]). In particular, Assumption 2 can be compared to the conditions 
on the sparsity-overlap function in [41] which impose assumptions on the population covariance matrix in 
relation to the true regression coefficients. Assumption 2 can also be compared to Condition 3 introduced 
in [17] that imposes lower bounds on the magnitudes of the true regression coefficients as well as on 
the true correlation coefficients between predictors and the response. Assumption 3 can be related to 
assumptions (A1)-(A3) in [41] in the sense that they both lead to regularity conditions on the entries and 
the eigenspectrum of the correlation matrix. Assumption 3 is also similar - to the concentration property 
introduced in [17] as they both yield regularity conditions on the inner products of the rows of the 
data matrix. Moreover, Assumption 3 can also be considered as an incoherence-type condition on the 
U-scores, similar - to the incoherence conditions on the design matrix assumed in the compressive sensing 
literature [12], [49], [13]. It is worth mentioning that a special case in which Assumption 3 is satisfied 
is the orthogonal setting where XX T /n = I n . 

Lemma 1 below specifies a class of p x p correlation matrices f l x for which Assumption 3 is satisfied. 

— 1/2 — 1/2 

Lemma 1: Assume that the population correlation matrix Q x = D s ~X/,D S is of the following 
weakly block-sparse form 

n x = n bs + n e , (17) 

in which Q bs is a px p block-sparse matrix of degree d x (i.e., by re-arranging rows and columns of Ll bs 
all non-zero off-diagonal entries can be collected in a d x x d x block), and il, = [<^ij\i<i,j<p is a p x p 
matrix such that u l3 = O ( f(\i — j|)) for some function /(.) with lim^oo /(f) = 0. If d x = o(p), then 
Assumption 3 holds. 

Proof of Lemma 1: See Appendix. □ 

Note that Lemma 1 is essentially a result of the application of the law of large numbers to the inner 
product of the rows of the U-score matrix ILF. More specifically, due to specific decomposition (17) for 
the correlation matrix O x , as p —> oo, the inner product of two different rows of U x converges to 0, 
as the proportion of the terms that are obtained by multiplication of significantly correlated variables 
converges to zero. 
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B. High dimensional asymptotic analysis for screening 

In this section, we establish a Poisson-Iike limit theorem for the number of variables that pass the 
SPARCS screening stage as p —> oo for fixed n. This yields a Poisson approximation to the probability 
of false discoveries that is accurate for small n and large p. The Poisson-like limit theorem also specifies 
a phase transition threshold for the false discovery probability. 

Lemma below states that the PCS method can be interpreted as a method for discovering the non-zero 
entries of a p x 1 vector with a special representation, by thresholding the entries at some threshold p. 
It is worth noting that a si mi lar result also holds true for SIS without Assumption 3. 

Lemma 2: Under Assumptions 1 and 3, the PCS algorithm for support recovery is asymptotically 
equivalent to thresholding the entries of a p x 1 vector <f> xy which admits the following representation: 

$ xy = (Z x ) T Z y , (18) 

in which 7 X is a (n — 1) x p matrix whose columns arc unit norm vectors, and 7 y is a (n — 1) x 1 unit 
norm vector. 

Proof of Lemma 2: See Appendix. □ 

For a threshold p 6 [0,1], let N p y denote the number of entries of a p x 1 vector of the form (18) 
whose magnitude is at least p. The following proposition gives an asymptotic expression for the expected 
number of discoveries E [N p y ], for fixed n, as p -A oo and p —> 1. It also states that under certain 
assumptions, the probability of having at least one discovery converges to a given limit. This limit is 
equal to the probability that a certain Poisson random variable N* with rate equal to lim p _ 5 . 0Oi/9 _ 5 .i E [N p y ] 
satisfies: N* > 0. The following proposition does not need the concentration property (14) to hold. 

Proposition 1: Consider the linear model (13). Let { p p } p be a sequence of threshold values in [0,1] 
such that p p —>• 1 as p —> oo and p( 1 — p^ !n ~ 2,/ ' 2 6n Under the Assumptions 1 and 3, if the number 
of active variables k grows at a rate slower than p, i.e., k = o(p), then for the number of discoveries 
K y v we have: 

lim E [N xy ] = lim f p ^ Pp = C„, (19) 

p —TOO p —TOO p 

where C P ,n, Pp = pPo{p,n) and ( n = e n a n /(n - 2). Moreover: 

lim P (N xy > 0) = 1 - exp(Cn)- (20) 

p —S'-OO iv 

Proof of Proposition 1: See Appendix. □ 
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Part x 
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Fig. 2. The first stage of SPARCS is equivalent to discovering the non-zero entries of the p x 1 vector $> xy in (18) to find 
variables X. t that are most predictive of the response Y. This is equivalent to finding sparsity in a bipartite graph G p (& xy ) with 
parts x and y which have vertices {A'i, ..., X p } and Y, respectively. For 1 < i < p, vertex Xi in part x is connected to vertex 
Y in part y if \(j> xy \ > p. 


Note also that Prop. 1 can be generalized to the case where Assumption 3 is not required. However when 
Assumption 3 is removed the asymptotic rates for E [N‘p y ] and P(Np y > 0) depend on the underlying 
distribution of the data. Such a generalization of Prop. 1 is given in the Appendix. 

Proposition 1 plays an important role in identifying phase transitions and in approximating p-values 
associated with individual predictor variables. More specifically, under the assumptions of Prop. 1: 

m x p y p > 0) ->• 1 - exp(-£p, n ,p p ) as p -> oo. (21) 

The above limit provides an approach for calculating approximate p-values in the setting where the 
dimension p is very large. For a threshold p e [0,1] define G p (<f> xy ) as the undirected bipartite graph 
(Fig. 2) with parts labeled x and y, and vertices {X \, X- 2 ,..., X p } in part x and Y in part y. For 1 < i < p, 
vertices X, t and Y arc connected if \(jp y \ > p, where (j\ y is the i-th entry of <F ;:z:y defined in (18). Denote 
by df the degree of vertex X, in Q p (& xy ). Note that df € {0,1}. For each 1 < i < p, denote by 
p(i) the maximum value of the threshold p for which df = 1 in Q fI (& :ry ). By this definition, we have 
p(i) = \(t> x i y \- Using Prop. 1 the /;-value associated with predictor variable X t can now be approximated 
as: 

pv{i) « 1 - exp(-^ n;p(i) ). (22) 

Similar - to the result in [30], [31], there is a phase transition in the p-values as a function of the 
threshold p. More exactly, there is a critical threshold p c such that if p > p c , the average number E [Np V ] 
of discoveries abruptly decreases to 0 and if p < p c the average number of discoveries abruptly increases 
to p. Motivated by this, we define the critical threshold p c as the threshold that satisfies the equation 
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dK[Np V ]/dp = —p. Using (19), the value of the critical threshold can be approximated as: 

p c = y/l - (a n p)-2/(n- 4 ). (23) 

Note that the expression given in (23) bears resemblance to the expression (3.14) in [30]. Expression 
(23) is useful in choosing the screening threshold p. Selecting p slightly greater than p c will prevent the 
bipartite graph Q p (<f> xy ) from having an overwhelming number of edges. 

C. High dimensional asymptotic analysis for support recovery 

In this section we give theoretical upper bounds on the Family-Wise Error Rate (FWER) when 
performing variable selection in SPARCS screening stage. 

Propositions 2 and 3 give upper bounds on the probability of selection error for the SPARCS screening 
stage by thresholding the vector (i.e. using SIS), or the vector H xy (i.e. using PCS), respectively. 

Proposition 2: Let S denote the support set selected using SIS and let l = |Sj be the size of this 
support. Under Assumptions 1 and 2, if n > 0(logp) then for any l > k, SIS recovers the support 7To, 
with probability at least 1 — 1 /p, i.e. 

IP (tto c 5) > 1 - 1 /p. (24) 

Proof of Proposition 2: See Appendix. □ 

Proposition 3: Let S denote the support set selected using PCS and let 1 = | .S' be the size of this 
support. Under Assumptions 1-3, if n > 0(logp) then for any l > k, PCS recovers the support -kq, with 
probability at least 1 — 1/p, i.e. 

p (tto c 5) > 1 - 1/p. (25) 

Proof of Proposition 3: See Appendix. 

□ 

The constant in 0(log p) of Prop. 2 and Prop. 3 is increasing in p nnn . It is shown in the proof of the 
propositions that 12/p m \ n is an upper bound for the constant in 0(logp). Note that the above propositions 
on support recovery allow all types of non-zero correlations (i.e., correlations between active variables, 
correlations between inactive variables, and correlations between active and inactive variables) as long 
as the corresponding assumptions are satisfied. 

Propositions 2 and 3 can be compared to Thm. 2 in [41] and Thm. 1 in [17] for recovering the 
support set 7To- More specifically, Thm. 2 in [41] asserts a similar - result as in Prop. 2 and Prop. 3 for 
support recovery via minimizing a LASSO-type objective function. Also Thm. 1 in [17] asserts that if 
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n = 0((logp)“) for some a > 1, SIS recovers the true support with probability no less than 1 — 1/p. 
Note also that Prop. 2 and Prop. 3 state stronger results than the similar results proven in [17] and in 
[41], respectively, in the sense that the support recovery guarantees presented in [17], [41] arc proven 
for the class of multivariate Gaussian distributions whereas Prop. 2 and Prop. 3 consider the larger class 
of multivariate elliptically contoured distributions. These distributions accommodate heavy tails. 

D. High dimensional asymptotic analysis for prediction 

The following proposition states the optimal sample allocation rule for the two-stage SPARCS predictor, 
in order to minimize the expected MSE as t —> oo. 

Proposition 4: The optimal sample allocation rule for the SPARCS online procedure introduced in 
Sec. II under the cost condition (1) is 

f O(logf), c{p - k)\ogt + kt < p 

n = < (26) 

I 0, o.w. 

where c is a positive constant that is independent of p. 

Proof of Proposition 4: See Appendix. □ 

The constant c above is an increasing function of the quantity p m ; n defined in (15). Proposition 4 asserts 
that for a generous budget (p large) the optimal first stage sampling allocation is (9(log t). However, when 
the budget is tight it is better to skip stage 1 (n = 0). Figure 3 illustrates the allocation region (for c = 1) 
as a function of the sparsity coefficient p = l — k/p. Note that Prop. 4 is generally true for any two-stage 
predictor which at the first stage, uses a support recovery method that satisfies the performance bound 
proposed by Prop. 2 or Prop. 3, and at the second stage uses OLS. 

IV. Numerical comparisons 

We now present experimental results which demonstrate the performance of SPARCS when applied 
to both synthetic and real world data. Throughout this section we refer to the SPARCS predictors which 
use SIS or PCS at the first stage as SIS-SPARCS or PCS-SPARCS, respectively. 

a) Efficiency of SPARCS screening stage. We illustrate the performance of the SPARCS screening stage 
(i.e., the first stage of the SPARCS predictor) using SIS or PCS and compare these to LASSO [48], [23]. 

In the first set of simulations we generated an n x p data matrix X with independent rows, each of 
which is drawn from a p-dimensional multivariate normal distribution with mean 0 and block-sparse 
covariance matrix satisfying (17). The p x 1 coefficient vector a is then generated such that exactly 100 
entries of a 6 K p arc active. Each active entry of a is an independent draw from AA(0,1) distribution, 
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Fig. 3. (Left) Surface p/p = cplogf + (1 — p)t, for c = 1. (Right) Contours indicating optimal allocation regions for p/p = 30 
and p/p = 60 (p = 1 — k/p ). As the coefficient c increases, the surface cplogf + (1 — p)f moves upward and the regions 
corresponding to n = O(logf) and n = 0, become smaller and larger, respectively. 


and each inactive entry of a is zero. Finally, a synthetic response vector Y is generated by a simple linear 
model 

Y = Xa + N, (27) 

where N is nx 1 noise vector whose entries are i.i.d. J\T(0, 0.05). The importance of a variable is measured 
by the magnitude of the corresponding entry of a. 

We implemented LASSO on the above data set using an active set type algorithm - asserted to be one 
the fastest methods for solving LASSO [36]. In all of our implementations of LASSO, the regularization 
parameter is tuned to minimize prediction MSE using 2-fold cross validation. To illustrate SPARCS 
screening stage for a truly high dimensional example, we set p = 10000 and compared SIS and PCS 
methods with LASSO, for a small number of samples. Figure 4 shows the results of this simulation 
over an average of 400 independent experiments for each value of n. As we see for small number of 
samples, PCS and SIS methods perform significantly better in selecting the important predictor variables. 
Moreover, the advantage of the extra pseudo-inverse factor used for variable selection in PCS as compared 
to SIS is evident in Fig. 4. 

b) Efficiency of the SPARCS predictor. To test the efficiency of the proposed SPARCS predictor, a total 
of t samples are generated using the linear model (27) from which n = 25 log t are used for the task of 
variable selection at the first stage. All t samples are then used to compute the OLS estimator restricted 
to the selected variables. We chose t such that n = (130 : 10 : 200). The performance is evaluated by 
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t-i:i r: 
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— LASSO 
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n 


Fig. 4. Average number of mis-selected variables. Active set implementation of LASSO (red-dashed) vs. SIS (green-dashed) 
vs. PCS (solid), p = 10000. The data is generated via model (27). The regularization parameter of LASSO is set using 2-fold 
cross validation. It is evident that PCS has a lower miss-selection error compared to SIS and LASSO. 

the empirical Root Mean Squared Error 


m 



(28) 


where m is the number of simulation trials. Similar to the previous experiment, exactly 100 entries 


of a are active and the predictor variables follow a multivariate normal distribution with mean 0 and 


block-sparse covariance matrix. Figure 5 shows the result of this simulation for p = 10000, in terms of 
performance (left) and running time (right). Each point on these plots is an average of 1000 independent 
experiments. Observe that in this low sample regime, when LASSO or SIS are used instead of PCS in the 
first stage, the performance suffers. More specifically we observe that the RMSE of the PCS-SPARCS 
predictor is uniformly lower than the SIS-SPARCS predictor or the two-stage predictor that uses LASSO 
in the first stage. Table I shows the p-values of one-sided paired t-tests testing for differences between 
the RMSE for PCS-SPARCS and SIS-SPARCS (LASSO) for several different values of n. These results 
show the high statistical significances of these RMSE differences. 

To further indicate the advantage of the PCS-SPARCS predictor compared to the SIS-SPARCS pre¬ 
dictor, we performed simulations in which the number of samples used at the first stage, n = 500, and 
the number of samples used at the second stage, t = 2000, are fixed while the number of variables p 
increases from p = 1000 to p = 100000. Moreover, exactly 100 entries of the coefficient vector a are 
active. Similar to the previous experiments, samples are generated using the linear model (27). However, 
in order to generate a data set with high multicollinearity, a scenario that is likely to happen in high 
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-PCS-SPARCS 
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n = 25 log(t) 
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Fig. 5. (Left) Prediction RMSE for the two-stage predictor when n = 25 log £ samples are used for screening at the first 
stage and all £ samples are used for computing the OLS estimator coefficients at the second stage. The solid plot shows the 
RMSE for PCS-SPARCS while the green and red dashed plots show the RMSE for SIS-SPARCS and LASSO, respectively. 
Here, p = 10000. The Oracle OLS (not shown), which is the OLS predictor constructed on the true support set, has average 
RMSE performance that is a factor of 2 lower than the curves shown in the figure. This is due to the relatively small sample 
size available to these algorithms. (Right) Average running time as a function of n for the experiment of the plot on the left. It 
is evident that due to lower computational complexity, SIS-SPARCS and PCS-SPARCS run an order of magnitude faster than 
LASSO. 


n 

130 

140 

150 

160 

170 

180 

190 

200 

PCS-SPARCS vs. SIS-SPARCS 

7.7 x 10“ 3 

6.7 x 10“ 09 

3.2 x 10 -11 

2.4 x 10“ 22 

7.8 x 10“ 29 

8.1 x 10“ 36 

9.2 x 10“ 42 

5.3 x 10 -46 

PCS-SPARCS vs. LASSO 

3.1 x 10 -4 

8.0 x IQ- 10 

7.2 x 10 -14 

3.0 x 10 -25 

1.8 x 10 _3 ° 

5.6 x 10 -39 

1.1 x 10 -42 

6.5 x 10 -48 


TABLE I 

p- VALUES OF THE ONE-SIDED PAIRED T-TEST FOR TESTING THE NULL HYPOTHESIS Ho: PCS-SPARCS AND SIS-SPARCS 
(LASSO) HAVE THE SAME AVERAGE PREDICTION RMSE IN THE EXPERIMENT CORRESPONDING TO FlG 5. SMALL 
p- VALUES SUGGEST THAT PCS-SPARCS SIGNIFICANTLY OUTPERFORMS THE OTHERS. 


dimensional data sets (see [44] and the references therein), here the inactive variables are consecutive 
samples of an Auto-Regressive (AR) process of the form: 

W( l) = e(l), 

W(i) = (j)W{i — 1) + e(i), i = 2, ... ,p — 100, (29) 

in which e(z)’s are independent draws of Af(0,1). The result of this experiment for <j> = 0.99 is shown in 
Fig. 6 (left). The average RMSE values are computed using 1000 independent experiments. The advantage 
of using PCS-SPARCS over SIS-SPARCS is evident in Fig. 6 (left). Note that as the number of variables 
p becomes significantly larger than the number of samples n, the performance of both of the predictors 
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converge to the performance of a random selection and estimation scheme in which variables are selected 
at random in the first stage. 

Furthermore, to analyze the performance of PCS-SPARCS and SIS-SPARCS for different levels of 
multicollinearity in the data, we performed similar experiments for p = [1000, 5000,10000] as the value 
of i ft increases from 0.9 to 0.999. Figure 6 (right) shows the result of this simulation. Each point on these 
plots is the average of 500 independent experiments. It is evident that similar to the previous experiment, 
the PCS-SPARCS predictor outperforms the SIS-SPARCS predictor. An interesting observation in Fig 
6 (right) is that as the multicollinearity coefficient — log 10 (l — ri) increases the performance of the 
PCS-SPARCS predictor improves. 




Fig. 6. (Left) Prediction RMSE for the two-stage predictor when n = 500 samples are used at the first stage, and a total 
of t = 2000 samples are used at the second stage. The number of variables varies from p = 1000 to p = 100000. In this 
experiment, inactive variables are generated via realizations of an Auto-Regressive process of the form (29) with <j> = 0.99 
(— log 10 (1 — (/>) = 2). The solid and dashed plots show the RMSE for PCS-SPARCS and SIS-SPARCS, respectively. The 
plots show the advantage of using PCS instead of SIS at the SPARCS screening stage. (Right) Prediction RMSE as function of 
the multicollinearity coefficient - log 10 (l - <j>) for p = [1000, 5000,10000], For both PCS-SPARCS (solid) and SIS-SPARCS 
(dashed) predictors, the plots with square, triangle and circle markers correspond to p = 10000, p = 5000 and p = 1000, 
respectively. These plots show that the PCS-SPARCS predictor uniformly outperforms the SIS-SPARCS predictor. Observe also 
that as the multicollinearity coefficient — log 10 (l — <j>) increases the performance of the PCS-SPARCS predictor improves. 


c) Estimation ofFWER using Monte Carlo simulation. We set p = 1000, k = 10, n = [100, 200,..., 1000] 
and using Monte Carlo simulation, we computed the probability of support recovery error for the PCS 
method. In order to prevent the coefficients a,j,j <E 7To from getting close to zero, the active coefficients 
were generated via a Bernoulli-Gaussian distribution of the form: 


a ~ 0.5A/'(1, cr 2 ) + 0.5AA(—1, a 2 ), 


(30) 
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Figure 7 shows the estimated probabilities. Each point of the plot is an average of N = 10 4 experiments. 
As the value of o decreases the quantity p m - m defined in (15) is bounded away from 0 with high probability 
and the probability of selection error degrades. As we can see, the FWER decreases at least exponentially 
with the number of samples. This behavior is consistent with the result in Prop. 3. 



Fig. 7. Probability of selection error as a function of number of samples for PCS. Probability of selection error is calculated 
as the ratio of the number of experiments in which the exact support is not recovered over the total number of experiments. The 
entries of the coefficient matrix are i.i.d. draws from distribution (30). Observe that the probability of selection error decreases 
at least exponentially with the number of samples. This behavior is consistent with Prop. 3. 

d) Application to experimental data. We illustrate the proposed SPARCS predictor on the Predictive 
Health and Disease data set, which consists of gene expression levels and symptom scores of 38 
different subjects. The data was collected during a challenge study for which some subjects become 
symptomatically ill with the H3N2 flu virus [34]. For each subject, the gene expression levels (for 
p = 12023 genes) and the clinical symptoms have been recorded at a large number of time points that 
include pre-inoculation and post-inoculation sample times. Ten different symptom scores were measured. 
Each symptom score takes an integer value from 0 to 4, which measures the severity of that symptom 
at the corresponding time. The goal here is to learn a predictor that can accurately predict the future 
symptom scores of a subject based on her last measured gene expression levels. 

We considered each symptom as a scalar response variable and applied the SPARCS predictor to each 
symptom separately. In order to do the prediction task, the data used for the SPARCS predictor consists of 
the samples of the symptom scores for various subjects at 4 specified time points (t\. t' 2 , f. 3 , ti) and their 
corresponding gene expression levels measured at the previous time points (t\ — 1 , A ~ 1 , t:\ — 1 , £4 — 1 ). 
The number of predictor variables (genes) selected in the first stage is restricted to 100. Since, the 
symptom scores take integer values, the second stage uses multinomial logistic regression instead of the 
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Symptom 

RMSE: LASSO 

RMSE: SIS-SPARCS 

RMSE: PCS-SPARCS 

Runny Nose 

0.7182 

0.6896 

0.6559 

Stuffy Nose 

0.9242 

0.7787 

0.8383 

Sneezing 

0.7453 

0.6201 

0.6037 

Sore Throat 

0.8235 

0.7202 

0.5965 

Earache 

0.2896 

0.3226 

0.3226 

Malaise 

1.0009 

0.7566 

0.9125 

Cough 

0.5879 

0.7505 

0.5564 

Shortness of Breath 

0.4361 

0.5206 

0.4022 

Headache 

0.7896 

0.7500 

0.6671 

Myalgia 

0.6372 

0.5539 

0.4610 

Average for all symptoms 

0.6953 

0.6463 

0.6016 


TABLE II 

RMSE OF THE TWO-STAGE LASSO PREDICTOR, THE SIS-SPARCS PREDICTOR AND THE PCS-SPARCS PREDICTOR USED 
FOR SYMPTOM SCORE PREDICTION. THE DATA COME FROM A CHALLENGE STUDY EXPERIMENT THAT COLLECTED GENE 
EXPRESSION AND SYMPTOM DATA FROM HUMAN SUBJECTS [34], LEAVE-ONE-OUT CROSS VALIDATION IS USED TO 

COMPUTE THE RMSE VALUES. 


OLS predictor. Maximum likelihood estimation is used for computing the multinomial logistic regression 
coefficients [1]. The performance is evaluated by leave-one-out cross validation. To do this, the data from 
all except one subject are used as training samples and the data from the remaining subject arc used as 
the test samples. The final RMSE is then computed as the average over the 38 different leave-one-out 
cross validation trials. In each of the experiments 18 out of the 37 subjects of the training set, are used 
in first stage and all of the 37 subjects are used in the second stage. It is notable that PCS-SPARCS 
performs better in predicting the symptom scores for 7 of the 10 symptoms whereas SIS-SPARCS and 
LASSO perform better in predicting the symptom scores for 2 symptoms and 1 symptom, respectively. 

V. Conclusion 

We proposed an online procedure for budget-limited predictor design in high dimensions dubbed two- 
stage Sampling, Prediction and Adaptive Regression via Correlation Screening (SPARCS). SPARCS is 
specifically useful in cases where n«p and the high cost of assaying all predictor variables justifies 
a two-stage design: high throughput variable selection followed by predictor construction using fewer 
selected variables. We established high dimensional false discovery rates, support recovery guarantees, and 
optimal stage-wise sample allocation rule associated with the SPARCS online procedure. Simulation and 
experimental results showed advantages of SPARCS as compared to LASSO. Our future work includes 
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using SPARCS in a multi-stage framework. We believe that multi-stage SPARCS can further improve the 
performance of the algorithm while benefiting from high computational efficiency. 

VI. Appendix 

This section contains three subsections. Section VI-A provides the proof of Lemma 2. Section VI-B 
introduces the necessary notations for the proofs of the remaining propositions. Section VI-C gives the 
proofs for the propositions presented in Sec. III. 


A. Lemma 2 and U-score representations 

Below we present the proof of Lemma 2 which states that both SIS and PCS methods for discovering 
the support are equivalent to discovering the non-zero entries of some px 1 vector & xy with representation 
(18) by thresholding at a specified threshold. 

Proof of Lemma 2: Using the U-score representation of the correlation matrices, there exist a (n — 1) x p 
matrix ILF with unit norm columns, and a (n - 1) x 1 unit norm vector U y such that [30], [31]: 

K xy = (LLF) r UF (31) 


Representation (31) immediately shows that SIS is equivalent to discovering non-zero entries of a vector 
with representation (18). Moreover, we have 

S xy = D| x (U a: ) T U 2/ (s y )2, (32) 


and: 

(S*)t = D s J((U*) T (U*(U x ) T )- 2 U x )D s j, (33) 

where Da denotes the diagonal matrix obtained by zeroing out the off-diagonals of square matrix A. 
We refer the interested reader to [31], [2] for more information about the calculations of U-scores. Using 
representations (32) and (33), one can write: 


Defining LF 


Y = {{S x ^S xy ) T X 

= (s y )2 (U 2/ ) r (U x (U a; ) T ) _1 U :c Dg|x. 


(IF(ILF) T ) 1 U a; D (u 5 a:)T(Ux(u , )T) _ 2Ux , we have: 


Y = 

= (s«)s( H^) r D 


n 2 v 

(U 33 ) T (HJ 33 (HJ X ) t )- 2 U ;e ' L, S x ^ 


TV 2 v 

(U X ) T (U X (U X ) T )- 2 U X s* 


(34) 


(35) 
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where 

H xy = (t) x ) T U y . (36) 

Note that the columns of the matrix LP lie on ,S ' n _2 since the diagonal entries of the p x p matrix 
(ILP : ) T, IIJ a: arc equal to one. Therefore, a U-score representation of the generalized OLS solution ~B xy can 
be obtained as: 

■Qxy _ (gzitgzy 

= D siD ( V ) x (u , (0 , r| -, u ,H*»(s>')‘, (37) 

Without loss of generality we can consider the case where = I p . Given the concentration property 
(14), asymptotically we have D$ —> Dj; = I p , with probability 1. Moreover Assumption 3 yields the 
asymptotic relationship D (U x)T( U: r( U x\Tj- 2 U x = (n — l) 2 /p 2 I p . Therefore, finding the largest entries of 
B T?y is equivalent to finding the largest entries of H' r;y as the ordering of the entries will asymptotically 
stay unchanged. This motivates screening for non-zero entries of the vector tt xy instead of the entries 
of In particular, for a threshold p e [0,1], we can undertake variable selection by discovering 
the entries of the vector H xy in (36) that have absolute values at least p. This implies that discovering 
the support via PCS is equivalent to discovering the non-zero entries of H ,7:y in (36) which admits the 

representation (18). The proof for SIS follows similarly. □ 


B. Notations and preliminaries 


The following additional notations arc necessary for the remaining propositions and the proofs presented 
in this section. 

For arbitrary joint densities /uf,u» (u, v), 1 < i < p defined on the Cartesian product S n - 2 x S n - 2 , 
define 

_ 1 p 

/uj,u»(u, v) = — /u?,u»(su,fv). (38) 

** i 1 S,ie{0,l} 


The quantity /uj,Uw(w v ) is key in determining the expected number of discoveries in screening the 
entries of the vector $> xy in (18). 

In the following propositions, q represents an upper bound on the number of entries in any row or 


column of covariance matrix or cross-covariance vector T, xy that do not converge to zero as p 
We define ||Ap^ )(? ||i, the average dependency coefficient, as: 


lATJli = Tew 

j=i 


00 . 


(39) 
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with 


A x p y n , q {i) = 


( fvf,Vy\V AqW ~ /uf,U»)//u?,U» 


(40) 


in which A q (i) is defined as the set complement of indices of the (/-nearest neighbors of Uf (i.e. the 
complement of indices of the q entries with largest magnitude in the i-th row of X x ). Finally, the function 
J of the joint density /u.v(u, v) is defined as: 

J(fvy) = \S n -2\ [ /u,v(w, w)dw. (41) 

The function J(/u,v) plays a key role in the asymptotic expression for the mean number of discoveries. 
Note that when observations arc independent, by symmetry, the marginal distributions of (/-scores are 
exchangeable, i.e., 


/u(u) = /u(nu) and / V (v) = / v (IIv), 


(42) 


for any (n — 1) x (n — 1) permutation matrix II. Therefore, the joint distribution /u,v must yield 
exchangeable marginals. 

We now present two examples for which J(/u,v) has a closed form expression. 

Example 1. If the joint distribution /u.v is uniform over the product S n - 2 x S n - 2 , 


= | *Sn— 21 

\S n - 2 \ 

|S„_ 2 | 


„_ 2 \S n - 2 \ 2 


= 1. 


dw 


(43) 


Example 2. Consider the case where the joint distribution /u.v is separable of the form 


/u,v(u,v) = /u(u)/v(v), 


(44) 


i.e., U and V arc independent. Let the marginals be von Mises-Fisher distributions over the sphere 5 n _ 2 

/u(u) = (7 n _i(K)exp(«;/i r u), u 6 S n _ 2 , (45) 


in which p and n > 0 arc the location parameter and the concentration parameter, respectively, and 
C n - i(k) is a normalization constant, calculated as: 

K ( n _i )/2 _i 


C n - i{k) = 


(46) 


(27r)( n 1 )/ 2 /( fl _i)/ 2 _ 1 (K)’ 

where I m is the modified Bessel function of the first kind of order m. I m {x) can be computed up to the 
desired precision using the expansion: 

(. x/2) 2l+n 


OO 

Im{x) = ^ o lT(l + m + l)' 


(47) 


October 4. 2016 


DRAFT 



24 


in which T(.) is the gamma function. 

Due to exchangeability of /u(u), the only two feasible choices for //, arc p, = 1 and p, = — 1, where 
1 = [1,1,..., 1} T . Hence the joint distribution can be written as: 

/u,v(u, v) = /u(u)/v(v) 

= C n -i(ni) exp(/ti/xf u)C' n _i(K 2 ) exp(K 2 /r^v) 

= C n -l(K 1 )C n -l(K 2 )exp(Kln{u + K-2fJ.2 v ) (48) 

Assuming p 1 = ail and p 2 = a 2 l, where ai,a 2 € {—1,1}, we obtain: 

/u,v(u,v) (49) 

= C' n _i(Ki)C' r) ,_i(K 2 )exp (l T (a 1 K 1 u + a 2 K 2 v)) . 

This yields: 


I S n — 2 1 I 


'S „-2 


C n -i{^i)C n -i{n 2 ) 


exp ((cciKi + a 2 n 2 )l I w) dw 
= \Sn-2\C n -l{^l)C n -l(K 2 ) / 

Js n - 2 

exp ((cci^i + 02^2) 1 T w) dw 

\Sn-‘2\C n ~l{^l)Cn-l{^ 2 ) 

C'n-i(|a 1 «i + a 2 n 2 \) 

Therefore, using (46) and (47), J(/u,v) can be computed up to the desired precision. 

Further properties as well as intuitive interpretations of J(/u,v) have also been considered in [30]. 


(50) 


C. Proofs of propositions 

We first prove the following more general version of Prop. 1. This generalization can be useful in 
obtaining an approximate false discovery rates for SPARCS screening stage in cases where the underlying 
distribution of data is known. 

Proposition 5: Consider the linear model (13) for which Assumption 1 is satisfied. Let ILF = [Uf, Uf,..., U*] 
and ILF = [U ;,/ ] be (n — 1) x p and (n — 1) x 1 random matrices with unit norm columns. Let {p p } p 
be a sequence of threshold values in [ 0 , 1 ] such that p p —> 1 as p —> 00 and p( 1 — Pp )( n ~ 2 )/ 2 _ g n . 
Throughout this proposition N p y denotes the number of entries of the p x I vector G xy = [W') r] U v 
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whose magnitude is at least p. We have: 

lim E[N xy ] = lim £p,„, Pj) J(/u*,u») 

p—>• oo ,v p —S*-oo 

= C n lim (51) 

p —^OO 

where £ p , n , Pp = pPo{p,n) and ( n = e n a n /(n - 2). 

Assume also that q = o(p) and that the limit of average dependency coefficient satisfies lim^oo || Ap^i, 9 ||i = 
0. Then: 


P {N xy > 0) ->• 1 - exp(-A^), 


with 


A xy = lim E[N xy ]. 


(52) 


(53) 


Proof of Prop. 5: Let df denote the degree of vertex X t in part x of the graph Q p (G xy ). We have: 

p 

K y = Y. (f i- ^ 54 ) 

i— 1 

The following representation for d x holds: 


d x = I(XJ y e A(r,U?)), 


(55) 


where A(r, Uf) is the union of two anti-polar caps in S n - 2 of radius y/2(1 — p) centered at Uf and 
—Uf. The following inequality will be helpful: 

IE[df] = ( du f dv /uf,u»(u, v) (56) 

j Sn- 2 j A(r,\l) 


< Poa n M yx , (57) 

where M y ^ = max* ||/ub|u x lloo. and Po is a simplified notation for Po(p,n). Also for i f j we have: 

E [d x d x ] < Pfa 2 n M* y , (58) 

where M xy is a bound on the conditional joint densities of the form /u x ,u x |U“- 
Application of the mean value theorem to the integral representation (56) yields: 


|E[df] - PoJ(fu f ,iJy)\ < f yx Por, (59) 

where f yx = 2and is a bound on the norm of the gradient: 

M y \ 1 = max IIVuv/u^iu*( u^lufjHoo. (60) 
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Using (59) and the relation r = O ((1 — p ) 1 ^ 2 ) we conclude: 

|E [df] - PoJ(7^)\ < O (P 0 (l - P ) l/2 ) • (61) 

Summing up over i we conclude: 

\m x p y ] - ^n, P J(h^)\ < O (pPoil - p) 1/2 ) 

= o(^(l- p) 1 / 2 ), (62) 

where rjp V = ]>Pq. This concludes (51). 

To prove the second paid of the theorem, we use Chen-Stein method [4], Define the index set B xy (i) = 
A fq V {i) — {*}, 1 < i < p, where Nq V (i) is the set of indices of the (/-nearest neighbors of Uf. Note that 
| B xy (i)\ < q. Assume Np Xy is a Poisson random variable with E[Np Xy ] = E [Np V ], Using theorem 1 of 
[4], we have: 

2 max A \E(N xy € A) - P (N* xy € A)\ 

<61 + 62 + 63, (63) 


where: 


and 


where A q (i) 


61 = E E E [<l E K], 

*=1 i£B*y(i) 

h=ib e e w< 5 ]. 

i =1 jeB x y(i) 


{B x y(i)) c 


b 3 = ^TE[E [d x - E[d x ]\d x : j G A,(i)]] , 

2=1 

— {i}. Using the bound (57), E[df] is of order O(Pq). Therefore: 


h < 0{pkP$) = 0({rj xy ) 2 q/p). 


Since i (f B xy (i), applying (58) to each term of the summation (65) gives: 


62 < O(pqPy) = 0((p xy ) 2 q/p). 


(64) 

(65) 

( 66 ) 

(67) 

( 68 ) 
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Finally, to bound 63 we have: 


6 3 = ]Te[E [df-m]\XJA q{i) ]\ 


i =1 


E 


i.i 


Ag (i) | 




du? 


du y 


Sn — 2 


I A(r, uf) 


/uf,u»|u AgW (^, ul/ |uA,(t)) ~ /uf,u»(uf,u y ) 
Ajr,uv(uf,uw) 

/uf ,u» (uf, u y )fu AqW (u AgW ) 

<0(pPo||A^|| 1 ) = 0«||A^ j(? || 1 ). 

Therefore using bound (62) we obtain: 


(69) 


|P {N x p y > 0 ) - (1 - exp(—A xy ))| < 

\P(Np V > 0) — (1 — exp(—E[dV xy ]))| 

+ |exp(-E[iV xy ]) - exp(—A xy )| < 
bi + b 2 + b 3 + 0{\E[N*y] - A xy \) < 

h + b 2 + 63 + O (r, xy (l - p) 1 / 2 ) . (70) 

Combining this with the bounds on 61,62 and 63 , completes the proof of (52). □ 

In order to obtain stronger bounds, we prove the Prop. 1 under the weakly block-sparse assumption 
(17). However the proof for the general case where Assumption 3 is satisfied follow similarly. 

Proof of Prop. 1: Proof follows directly from Prop. 5 and Lemma 3 presented below. □ 

Lemma 3: Assume the hypotheses of Prop. 1. Assume also that the correlation matrix 12,, is of the 
weakly block-sparse from (17) with d x = o(p). We have: 

= U x (l + 0(d x /p)). (71) 

Moreover, the 2-fold average function J(/uj,u«) and the average dependency coefficient | Ap y ,^| satisfy 


J (/uj,u») — 1 + 0((k + d x )/p ), 


(72) 


l|A* y 


n,q l 


= 0 . 


(73) 


Furthermore, 


J (/uj,u«) = 1 + 0{m&x{d x /p,d xy /p}) 


(74) 


l|A^J|i = 0(d x /p). 


(75) 
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Proof of Lemma 3 : We have: 

U* = (U I (U a; ) T )“ 1 U :E D| u t )T(Ux (Ux)T) _ 2 U x- (76) 

By block sparsity of 12/„, LF' can be partitioned as: 

U* = [U^IEf], (77) 

where U x = [Uf, • • • ,XJ d ] are the U-scores coiTesponding to the dependent block of 12/,., and 11'’ = 
[Ui,--- ,v x p _ dx \ arc the remaining U-scores. 

Using the law of large numbers for a sequence of correlated variables (see, e.g.. Example 11.18 in 
[46]) since the off-diagonal entries of 12., that arc not in the dependent block converge to 0 as \i — j\ 
grows, we have 

—VU"(U") T -A E[Ui(Ui) t ] = -^—In- 1 . (78) 

p — a x n — 1 

Since the entries of 1 /dfff iW' ) r are bounded by one, we have: 

f = 0(d x /p), (79) 

P 

where 0(u) is an (n — 1) x (n — 1) matrix whose entries are 0(u). Hence: 

(U a: (U a: ) T )" 1 U a; = (U* (U x ) T + U a: (U a: ) T )' 1 U x 

= — (In-l + O {d x /p))- l W 

p 

77 — 1 

= -- W(l + 0(d x /p)). (80) 

P 

Hence, as p —> oo: 

(U a: ) T (U a; (U x ) T )“ 2 U :r = 

= (—) 2 (U*) T U"(1 + 0(d x /p)). (81) 

P 

Thus: 


D(u^) T (u x (u a: )' r )- 2 u x — 

= (1 + 0(4/^))). 

Combining (82) and (80) concludes (71). 


(82) 
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Now we prove relations (72)-(75). Define the partition {1,... ,p} = VWD 0 of the index set {1,... ,p}, 
where V = {i : Uf is asymptotically uncorrelated of ILF}- We have: 

= E+ (83) 

^ s,te{— 1 , 1 } *ex> iex> c 

and 

lAgUli - ;<£+ £)*£,<<)• <«) 

*ex> iev c 

But, J(/ S u?,tu») = 1 for i <E D and A p V n , q (i) = 0 for 1 < i < p. Moreover, we have \V C \ < d xy , where 
d xy = k + d x . Therefore,: 

= 1 + 0(d xy /p). (85) 

Moreover, since IP = IF (1 + 0(d x /p)), fjj x UB = /uf,u» (1 + 0(d x /p)). This concludes: 

= 1 + 0(max.{d x /p,d xy /p}), (86) 

and 

IIA^Hi =°(d x /p). (87) 

□ 

Proof of Lemma 1: By block sparsity of ili, s , HP can be partitioned as: 

IP = [HP, if], ( 88 ) 

where HP = [Uf, • • • . Uf ] arc the U-scores corresponding to the dependent block of 12 /,. s and Up = 
arc the remaining U-scores. Using relations (78) and (79) we have: 

—Vr) r = (uwf + uW) T ) 

p p 

= I n _i + (n - 1)0 (d x /p). (89) 

Noting that d x = o(p) the result follows. □ 

The following lemma will be useful in the proof of proof of Prop. 2. 

Lemma 4: Assume Z\, Z 2 and Z arc jointly elliptically contoured distributed random variables from 
which n joint observations arc available. Further assume that the n x 3 matrix Z of these observations 
has an elliptically contoured distribution of the form given in Assumption 1. Let p\ = Cor(Z, Zf) and 
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P 2 = Cor (Z, Z- 2 ). Also let r\ = SampCor(Z, Zf) and = SampCor(Z, Zf), be the corresponding sample 
correlation coefficients. Assume that |pi| > p 2 1 ■ Then, there exists C > 0 and N such that: 

IP{1 7*2 1 > |n|} < exp(—Cn), (90) 

for all n > N. 

We use the following lemma to prove Lemma 4. 

Lemma 5: Let U and V be two independent uniformly distributed random vectors on S n - 2 . For any 
fixed e > 0, there exists C > 0 such that: 

P{|U T V| > e} < exp(—Cn). (91) 


Proof of Lemma 5: Without loss of generality assume U = [1,0,... ,0] T . We have 


{|UfU 1 |>e} = {|n 1 |>e} ) 


(92) 


in which v\ is the first entry of the vector V. Using the formula for the area of spherical cap [40] we 
obtain 


IP{|Uf'U 1 | > e} = I\(n/2,1 /2), 


where A = 1 — e 2 , and 


L x (o, V) 


j^t a ~ l (l-t) b - l dt 

j£* l - 1 0--t) b - 1 dt 


is the regularized incomplete beta function. Note that: 


(93) 


(94) 


1 / I\{n/2, 1 / 2 ) = 

_ / A t^yyvr^tdt+ jj f( n - 2 )/ 2 /vT^idf 

/ 0 A i(«-2)/2/ 

_ t A 1 t^yyvr^tdt 

J A f( n - 2 )/ 2 /\/l — tdt 

>1 ! tiH n -W/VT=\dt 

Jq f (n_2)/2 /vT^Adf 

1 — A n / 2 

= 1 + -^75“ = <C>” (M) 

Therefore by letting C = — \ log(A) = — | log(l — e 2 ) we obtain 

P{|U^Ui| > e} < exp(-Cn). (96) 

□ 
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Proof of Lemma 4: Let Z = \Z 2 , Z\. Z} 1 . Assume Z follows an elliptically contoured density func¬ 
tion of the form /z(z) = \Y, z \~ l / 2 g ((z — /j, z ) T J] z ~ 1 (z — /x,)). Without loss of generality assume 
Var(Zi) = Var(Z 2 ) = Var(Z) = 1. Using a Cholesky factorization we can represent Z\.Z 2 and Z as 
lineal - combination of uncorrelated random variables W\, W 2 and W which follow a spherically contoured 
distribution: 


where 


and 


^2 


1 0 0 


w 2 

Zi 

= 

a b 0 

X 

Wi 

Z 


c d e 


W 


pi = ac + bd, 
P 2 = c, 
a 2 + b 2 = 1, 


c 2 + d 2 + e 2 = 1 . 


(97) 


(98) 

(99) 

( 100 ) 


( 101 ) 


Let W = [W 2 , Wi, W] T . Since W follows a spherically contoured distribution, it has a stochastic 
representation of the form W = 72U, where R has a marginal density fif r) = ah{r 2 )r 2 , in which a 
is a normalizing constant. Moreover U is independent of R and the distribution of U does not depend 
on the function h (see, e.g., Chapter 2 in [2] for more details about such stochastic representation). 
Now let Uf, U 2 and U 2 denote the U-scores corresponding to n independent samples of Z\, Z 2 and Z, 
respectively. Then under Assumption 1, as these U-scores are invariant to translation and scale on the 
n samples of Z\.Z 2 . Z, the joint distribution of the U-scores does not depend on g and without loss of 
generality the n samples can be assumed to be i.i.d. Gaussian [3]. Similarly, let Uj 1 '. U|’ and U" : denote 
the U-scores corresponding to lUi, U 2 and W, respectively. Using (97) we have the following relations: 


= U?, 

Uj = (oUS' + 6U5r)/||aU^ + 6 Uf|| 2 , 

U 2 = {cUf+ dUf+ eU u ’)/ 

||cU|' + dUj’ + eU w || 2 . (102) 
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Hence 

n = (U“) T uf = 

l 

HcUf 1 + dU^ + eXJ w \\ 2 \\aU™ + bUfh * 

(ac + bd + 6c(U^) T Uf + ad( U^) T U^ 

+ae(U w ) T U^ + &e(ir) T uA 

and 

r 2 = (U 2 ) t U1 

c + d(U^) T Uf + e(lT ) T U^ 

||cU?f + dUf + eXJ^ 11 2 ' 

Now let E = {|r 2 | > |ni|}. We have: 

E = { |U r U 2 | > |U t U!|} = 

| ||aU™ + 6Un| 2 |c + d(Ur) T U^ + e(U w ) T U^ 
> ac + 6d + 6c(U^) T U^ + ad(U^) T U^ + 
+ae(U^) T U|’ + &e(U w ) T U?' }. 

Since 

||aU?f + 6Ui’|| 2 = yj (aU^ + 6Uf ) T (aU^ + 6Uf) 

= y/a 2 + b 2 + 2ab(U%) T U' t £ 

= ^1 + 2a6(U|') T Uf 

< 1 + 2|a6|.|(U|’) T U“|, 


( 103 ) 


(104) 


(105) 


(106) 
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and, by using triangle inequality, we have 

E C |2|a6c|.|(U t 2 u ) T U ^| 2 + 

2|e|.|(U u ’) T U^|.|(U^) T U“| + 

\ad + 6 c|.|(U^) t U^| + |ae|.|(U' u ') T U/| + 

| 6 e|.|(U w ) T U^| > \ac + bd\ - |c|} 

C {2|a6c|.|(U^) T U/| 2 > \ac + bd\ - |c|}[J 
{2|e|.|(U-) T U^|.|(U^) T Un > \ac + bd\ - |c|}J 
{\ad + bc\.\(Uf) T Uf | > \ac + bd\ — \c DU 
{|ae|.|(U w ) T U7| > \ac + bd\ - |c|}[J 
{| 6 e|.|(U”') T U^| > \ac + bd\ - \c\} 

C {|(U^) T Un > (\ac + bd\ - |c|)/2|o6c|} (J 
{\(Vf) T Vf\>(\ac+bd\-\c\)/2\e\} (J 
{ITOfUfl > (\ac + bd\ - \c\)/\ad + bc\}\J 
{\(XJ w ) T XJf\>(\ac + bd\-\c\)/\ae\} J 

{|(U w ) T Un > (\ac + bd\ - |c|)/|6e|}. (107) 

Note that by assumption |ac + bd\ = |pi| > \p 2 \ = |c|. Now by Lemma 5 we get 

F(E) < 5exp(—cm), (108) 


with 


| ac + bd\ — |c| 

max{ 2 |a 6 c|, 2\e\, \ad + be |, \ae\, \be\} 

Pi ~ P2 
2 ’ 


where the last inequality is obtained via equations (98)-( 101). Letting C 
12 /(pi — p 2 ) we have 


(109) 

(Pi - / 02 )/3 and N = 


P (E) = P{|r* 2 1 > |ri|} < exp(— Cn), (110) 

for n > N. □ 

Proof of Proposition 2: Since P ( 7 To C S) increases as the size of the recovered set S increases, it suffices 
to prove the proposition for l = k. Define an auxiliary random variable X ax such that Cor(Y, X ax ) = 
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(max je { 1; ... \p y j\ + min ie7ro \p yi \) /2. Note that by Assumption 2 max i£{1 ... ;P }\ 7r[J |p M 

miiij e7ra | Pyi |• For l = k we have: 

P(vro^5)=P(7r 0 /5) 

< p( U (M < |SampCor(y,X ax )|} 

\ iS7r 0 

U {kj/jl > |SampCor(y,X ax )|} 

je{l,...,p}\7T 0 

< ^P(Vyi| < iSampCor^Xax)!) + 

iSTTo 

E p(kj/j| > |SampCor(y,Xax)|). 

je{l,...,p}\7T 0 

Now since Assumptions 1 and 2 are satisfied, by Lemma 4 there exist constants C t > 0,1 
a constant iV such that 

P(vr 0 /5) 

< exp(—Cjn) + exp(— Cjn) 

i&ITo je{l,...,p}\7T 0 

< pexp(-C' min n), Vn > AT, 

in which C' m in = mini<j< p Q = pmin/6. Hence by letting C = 2/C' m in = 12/pmin and n - 
have: 

P(tt 0 /5)< 

P 

and 

P( 7ro = 5) = l—P( tt 0 /5)>1--, 

P 

which completes the proof. 

Proof of Proposition 3: We only provide a proof sketch here. By Assumption 3 we have 

U-(IT) T = (In—t + 0(1)) . 
n — 1 

Therefore: 

(U^U*) 7 )- 1 = — (In-! +0(1)) . 

p 

Since columns of ILF have unit norm we obtain: 

(U*(U a! ) r )- 1 U a! = — -u®(l + o( 1)), 

P 


< Cor(y, X a f) < 

( 111 ) 

< i < p and 

( 112 ) 

= Clogp we 

(113) 

(114) 

□ 

(115) 

(116) 

(117) 
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and 

(U*) T (U x (U*) T )" 2 lLr = 

2 T (118 ) 

P 

This yields 

D(u") T (u x (u a: )' r )- 2 u x = (—~—) 2 Ip(l + o(l)), (119) 

which implies 

= 1^(1+ o(l)). (120) 

where, by the concentration assumption, with high probability the term o(l) decays to 0 exponentially 
fast. Therefore screening the entries of B'’ ?y or H XJ/ is asymptotically equivalent to selecting the support 
via thresholding the entries of (U X ) 7 ’U 3/ , i.e., the sample correlation coefficients. Therefore the proof 
follows from Prop. 2. □ 

Proof of Proposition 4: First we consider a two-stage predictor similar to the one introduced in Sec. II 
with the difference that the n samples which are used in stage 1 are not used in stage 2. Therefore, there 
are n and t — n samples used in the first and the second stages, respectively. We represent this two-stage 
predictor by n|(t — n). Similarly, n\t denotes the SPARCS algorithm which uses n samples at the first 
stage and all of the t samples at the second stage. The asymptotic results for the n\(t — n) two-stage 
predictor will be shown to hold as well for the n\t two-stage predictor. 

Using inequalities of the form (112) and the union bound, it is straightforward to see that for any 
subset 7T / 7To of k elements of {1, • • • , p}, the probability that vr is the outcome of variable selection 
via SPARCS, is bounded above by pc”, in which 0 < c T < 1 is a constant that is bounded above by 
exp(— C m i n ). The expected MSE of the n\ it — n ) algorithm can be written as: 

E[MSE] = J2 P(7r)E[MSE 7r ] + P(7T 0 )E[MSE 7ro ], (121) 

7reSf ,7r^7r 0 

where Sf, is the set of all k -subsets of {1, • • ■ . />}, P(7r) is the probability that the outcome of variable 
selection via SPARCS is the subset 7 r, and MSE^ is the MSE of OLS stage when the indices of the 
selected variables are the elements of 7 r. Therefore the expected MSE is upper bounded as below: 

E[MSE] < (1 -pcft)E[MSE*] + 

+ P J2 c ” E [ MSE -]> ( 122 ) 

7reSf ,7r^7r 0 
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where co is a constant which is upper bounded by exp(—C' nim )- It can be shown that if there is at least 
one wrong variable selected (ir / 7 To), the OLS estimator is biased and the expected MSE converges to 
a positive constant M n as (t — n) —> oo. When all the variables are selected correctly (subset 7 To), MSE 
goes to zero with rate 0(l/(t — n)). Hence: 

IE [MSE] < 

(! -P<%)0(l/{t-n)) +p c n M *< 

ir(zS%,Tr^ir 0 

(1 - pc%)C 2 /{t -n)+ / +1 CiC n , (123) 

where C, Cj and C 2 are constants that do not depend on a or p but depend on the quantities Yhje-n 0 a j an d 
min je7ro la.I/E^o l a t|- Note that C = max 7 rg 5 P )7r ^ 7ro c n < exp(— C m i n ). This quantity is an increasing 
function p mvn ■ 

On the other hand since at most t variables could be used in OLS stage, the expected MSE is lower 
bounded: 

IE [MSE] > 0(l/t). (124) 

It can be seen that the minimum of (123) as a function of n, subject to the constraint (1), happens 
for n = O(logt) if clog t < ^ ^ with c = —1/logC (therefore, si mi lar to C, c is increasing in p m ; n ); 
otherwise it happens for 0. If ©(log t) < ^ the minimum value attained by the upper bound (123) 
is 0(l/t) which is as low as the lower bound (124). This shows that for large t, the optimal number 
of samples that should be assigned to the SPARCS stage of the n\(t — n) predictor is n = Of log t). As 
f —^ oo, since n = O(logf), the MSE of the n\t predictor proposed in Sec. II converges to the MSE of 
the n|(t — n) predictor. Therefore, as t —> oo, n = O(logf) becomes optimal for the n|t predictor as 
well. □ 
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